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Abstract 


Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional 
problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaus- 
sian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend 
linear models by replacing linear functions with a set of one-dimensional smooth functions, the 
nonparanormal extends the normal by transforming the variables by smooth functions. We derive a 
method for estimating the nonparanormal, study the method’s theoretical properties, and show that 
it works well in many examples. 


Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, £, regular- 
ization, graphical lasso, paranormal, occult 


1. Introduction 


The linear model is a mainstay of statistical inference that has been extended in several important 
ways. An extension to high dimensions was achieved by adding a sparsity constraint, leading to the 
lasso (Tibshirani, 1996). An extension to nonparametric models was achieved by replacing linear 
functions with smooth functions, leading to additive models (Hastie and Tibshirani, 1999). These 
two ideas were recently combined, leading to an extension called sparse additive models (SpAM) 
(Ravikumar et al., 2008, 2009a). In this paper we consider a similar nonparametric extension of 
undirected graphical models based on multivariate Gaussian distributions in the high dimensional 
setting. Specifically, we use a high dimensional Gaussian copula with nonparametric marginals, 
which we refer to as a nonparanormal distribution. 

If X is a p-dimensional random vector distributed according to a multivariate Gaussian distribu- 
tion with covariance matrix X, the conditional independence relations between the random variables 
X1, X2,..., Xy are encoded in a graph formed from the precision matrix Q = X-!. Specifically, miss- 
ing edges in the graph correspond to zeroes of Q. To estimate the graph from a sample of size n, it 
is only necessary to estimate X, which is easy if n is much larger than p. However, when p is larger 
than n, the problem is more challenging. Recent work has focused on the problem of estimating the 
graph in this high dimensional setting, which becomes feasible if G is sparse. Yuan and Lin (2007) 
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Figure 1: Comparison of regression and graphical models. The nonparanormal extends additive 
models to the graphical model setting. Regularizing the inverse covariance leads to an 
extension to high dimensions, which parallels sparse additive models for regression. 


and Banerjee et al. (2008) propose an estimator based on regularized maximum likelihood using an 
£; constraint on the entries of Q, and Friedman et al. (2007) develop an efficient algorithm for com- 
puting the estimator using a graphical version of the lasso. The resulting estimation procedure has 
excellent theoretical properties, as shown recently by Rothman et al. (2008) and Ravikumar et al. 
(2009b). 


While Gaussian graphical models can be useful, a reliance on exact normality is limiting. Our 
goal in this paper is to weaken this assumption. Our approach parallels the ideas behind sparse 
additive models for regression (Ravikumar et al., 2008, 2009a). Specifically, we replace the Gaus- 
sian with a semiparametric Gaussian copula. This means that we replace the random variable 
X = (Xi,..., Xy) by the transformed random variable f (X) = (f1(X1),---;fp(Xp)), and assume that 
f (X) is multivariate Gaussian. This semiparametric copula results in a nonparametric extension of 
the normal that we call the nonparanormal distribution. The nonparanormal depends on the func- 
tions { f;}, and a mean u and covariance matrix È, all of which are to be estimated from data. While 
the resulting family of distributions is much richer than the standard parametric normal (the para- 
normal), the independence relations among the variables are still encoded in the precision matrix 
Q — X. We propose a nonparametric estimator for the functions 1 f j}, and show how the graphical 
lasso can be used to estimate the graph in the high dimensional setting. The relationship between 
linear regression models, Gaussian graphical models, and their extensions to nonparametric and 
high dimensional models is summarized in Figure 1. 


Most theoretical results on semiparametric copulas focus on low or at least finite dimensional 
models (Klaassen and Wellner, 1997; Tsukahara, 2005). Models with increasing dimension require 
a more delicate analysis; in particular, simply plugging in the usual empirical distribution of the 
marginals does not lead to accurate inference. Instead we use a truncated empirical distribution. We 
give a theoretical analysis of this estimator, proving consistency results with respect to risk, model 
selection, and estimation of © in the Frobenius norm. 


In the following section we review the basic notion of the graph corresponding to a multivariate 
Gaussian, and formulate different criteria for evaluating estimators of the covariance or inverse 
covariance. In Section 3 we present the nonparanormal, and in Section 4 we discuss estimation of 
the model. We present a theoretical analysis of the estimation method in Section 5, with the detailed 
proofs collected in an appendix. In Section 6 we present experiments with both simulated data and 
gene microarray data, where the problem is to construct the isoprenoid biosynthetic pathway. 
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2. Estimating Undirected Graphs 


Let X = (Xi,..., X) denote a random vector with distribution P = N(u,X). The undirected graph 
G — (V, E) corresponding to P consists of a vertex set V and an edge set E. The set V has p elements, 
one for each component of X. The edge set E consists of ordered pairs (i, j) where (i, j) € E if there 
is an edge between X; and X;. The edge between (i, j) is excluded from E if and only if X; is 
independent of X; given the other variables X (;  — (Xs: 1€ s €p, sz ij), written 


Xi IL Xj Xing. (D) 


It is well known that, for multivariate Gaussian distributions, (1) holds if and only if Q;; — 0 where 
i» 

Let xD xO, hee XO be a random sample from P, where X c gr. If n is much larger than p, 
then we can estimate X using maximum likelihood, leading to the estimate Q = S~!, where 


s= MEG) (x2) 


is the sample covariance, with X the sample mean. The zeroes of Q can then be estimated by 
applying hypothesis testing to Q (Drton and Perlman, 2007, 2008). 

When p > n, maximum likelihood is no longer useful; in particular, the estimate X is not positive 
definite, having rank no greater than n. Inspired by the success of the lasso for linear models, several 
authors have suggested estimating X by minimizing 


—£(Q) +24) | | 


jtk 
where ; 
&Q) = 7 (log |Q] — tr(QS) — plog(2n)) 


is the log-likelihood with S the sample covariance matrix. The estimator Q can be computed ef- 
ficiently using the glasso algorithm (Friedman et al., 2007), which is a block coordinate descent 
algorithm that uses the standard lasso to estimate a single row and column of Q in each iteration. 
Under appropriate sparsity conditions, the resulting estimator Q has been shown to have good the- 
oretical properties (Rothman et al., 2008; Ravikumar et al., 2009b). 

There are several different ways to judge the quality of an estimator $ of the covariance or 
Q of the inverse covariance. We discuss three in this paper, persistency, norm consistency, and 
sparsistency. Persistency means consistency in risk, when the model is not necessarily assumed 
to be correct. Suppose the true distribution P has mean jo, and that we use a multivariate normal 
p(x;uo,X) for prediction; we do not assume that P is normal. We observe a new vector X ~ P and 
define the prediction risk to be 


R(X) = —Elog p(X;uo,X) = - [opos E) dP (x). 


It follows that 
(tr(Z7! £o) + log |Z| — plog(2x)) 
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where Xo is the covariance of X under P. If S is a set of covariance matrices, the oracle is defined 
to be the covariance matrix L, that minimizes R(X) over $: 


Y, = arg min; ,R(X). 


Thus p(x; uo, X.) is the best predictor of a new observation among all distributions in (p(x;uo, X) : 
Y c S}. In particular, if § consists of covariance matrices with sparse graphs, then p(x;uo, X.) is, in 
some sense, the best sparse predictor. An estimator X, is persistent if 


R(à,)—R(X.)0 


as the sample size n increases to infinity. Thus, a persistent estimator approximates the best estima- 
tor over the class S, but we do not assume that the true distribution has a covariance matrix in $, or 
even that it is Gaussian. Moreover, we allow the dimension p — p, to increase with n. On the other 
hand, norm consistency and sparsistency require that the true distribution is Gaussian. In this case, 
let Xo denote the true covariance matrix. An estimator is norm consistent if 


|£ -zj ^o 
where ||- || is a norm. If E(€2) denotes the edge set corresponding to Q, an estimator is sparsistent if 
P(E(Q) + E®,)) 0. 


Thus, a sparsistent estimator identifies the correct graph consistently. We present our theoretical 
analysis on these properties of the nonparanormal in Section 5. 


3. The Nonparanormal 


We say that a random vector X = (X1,..., X5)! has a nonparanormal distribution if there exist 
functions {f} - such that Z = f (X) ~ N(u, E), where f(X) = (fi(Xi),..., fp(Xp)). We then write 


X ~ NPN (p, £, f). 


When the f;’s are monotone and differentiable, the joint probability density function of X is given 
by 





Pus d exp{ Dor uz (6) - Tuto! Q) 
( p/2y/2 2 H jj 


2x) 


Lemma 1 The nonparanormal distribution NPN (u,X, f) is a Gaussian copula when the f;'s are 
monotone and differentiable. 


Proof By Sklar's theorem (Sklar, 1959), any joint distribution can be written as 
FO. p Xp) = Cina), E Fp(xp)) 
where the function C is called a copula. For the nonparanormal we have 


F(x,. cis Xp) = o, (^ | (Fi (x1)), dus S | (F,(xp))) 
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where ®,,y is the multivariate Gaussian cdf and ® is the univariate standard Gaussian cdf. Thus, 
the corresponding copula is 


C(ui,..., up) = Oux( I u),..., P7! (up)). 


This is exactly a Gaussian copula with parameters u and X. If each f; is differentiable then the 
density of X has the same form as (2). H 


Note that the density in (2) is not identifiable; to make the family identifiable we demand that 
fj preserve means and variances: 


uj = E(Zj) =E(X;) and oj = Ej; = Var (Zj) = Var (X;). (3) 


Note that these conditions only depend on diag(X) but not the full covariance matrix. 
Let F;(x) denote the marginal distribution function of X;. Then 


Fj(x) = P(X; <x) 2 P(Z; < fj()) 2 (45-4) 


which implies that 
fi) 2 uj toj! (Fj(x)). (4) 


The following basic fact says that the independence graph of the nonparanormal is encoded in 
Q = Xl, as for the parametric normal. 


Lemma 2 /f X ~ NPN (u,X, f) is nonparanormal and each fj is differentiable, then X; 1L X; [Xr jy 
if and only if Q;; — 0, where Q = X 


Proof From the form of the density (2), it follows that the density factors with respect to the graph 
of Q, and therefore obeys the global Markov property of the graph. a 


Next we show that the above is true for any choice of identification restrictions. 


Lemma 3 Define 
hj(x) = ^ (Fi) (5) 


and let A be the covariance matrix of h(X). Then Xj 1L Xx|X\{j,44 if and only if Asl =0. 
Proof We can rewrite the covariance matrix as 
Lik = Cov(Zj, Zi) = © jOxCov(h j(X;), he (Xx))- 


Hence X = DAD and 





y!2p!Adp! 


where D is the diagonal matrix with diag(D) = c. The zero pattern of A7! is therefore identical to 
the zero pattern of X-!. a 
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Figure 2: Densities of three 2-dimensional nonparanormals. The component functions have the 
form f;(x) = sign(x)|x|%. Left: o = 0.9, o? = 0.8; center: o; = 1.2, o» = 0.8; right 
OL; = 2, 05 = 3. In each case u = (0,0) and X = (7): 


Thus, it is not necessary to estimate u or © to estimate the graph. 
Figure 2 shows three examples of 2-dimensional nonparanormal densities. In each case, the 
component functions f;(x) take the form 


fj(X) = ajsign(x)|xl" +b; 


where the constants a; and b; are set to enforce the identifiability constraints (3). The covariance in 
each case is  — i) and the mean is u = (0,0). The exponent o; determines the nonlinearity. It 
can be seen how the concavity of the density changes with the exponent c, and that œ > 1 can result 
in multiple modes. 

The assumption that f(X) = (f1(X1),---,fp(Xp) is normal leads to a semiparametric model 
where only one dimensional functions need to be estimated. But the monotonicity of the functions 
fj, which map onto R, enables computational tractability of the nonparanormal. For more general 
functions f, the normalizing constant for the density 


px) exp | —5 G6 iu £1 (06 40] 


cannot be computed in closed form. 
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4. Estimation Method 
Let X(U,..., X) be a sample of size n where X® = (xl. f X(t € R”. In light of (5) we define 


hj(x) = D(F; (x) 


where F; is an estimator of F;. A natural candidate for F; is the marginal empirical distribution 
function 


Now, let O denote the parameters of the copula. Tsukahara (2005) suggests taking @ to be the 
solution of 
Vo (Fix). Fey 

0 (Ax Joss (XS ),9) =0 
=] 


l 


where ọ is an estimating equation and F;(t) = nF;(t) /(n+ 1). In our case, 0 corresponds to the 
covariance matrix. The resulting estimator 6, called a rank approximate Z-estimator, has excellent 
theoretical properties. However, we are interested in the high dimensional scenario where the di- 
mension p is allowed to increase with n; the variance of F(t) is too large in this case. Instead, we 


use the following truncated or Winsorized! estimator: 


8, if F(x) < 8, 
F(x)=¢ F(x) ifin < F(x) €1—8, (6) 
(1-8,) if F(x) > 1-6, 


where à, is a truncation parameter. Clearly, there is a bias-variance tradeoff in choosing 5,. Essen- 
tially the same estimator with 6, = 1/n is studied by Klaassen and Wellner (1997) in the case of 
bivariate Gaussian copula. In what follows we use 


1 
— 4nl/^ /xlogn. 


This provides the right balance so that we can achieve the desired rate of convergence in our estimate 
of and the associated undirected graph G in the high dimensional setting. 

Given this estimate of the distribution of variable X;, we then estimate the transformation func- 
tion f; by 


à, 





F(x) = Àj 6 hj) (7) 
where E E 
ij) - e (F) 
and zi; and 6; are the sample mean and the standard deviation: 


~ lue. A E je 
cy" and oj = E-a) ? 


i-l 








1. After Charles P. Winsor, whom John Tukey credited with converting him from topology to statistics Mallows 1990. 
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Now, let S, (f) be the sample covariance matrix of f(X!)),..., f(X)); that is, 


s) = Ly (FR) - 0) (FO) onl)" (& 
i=1 

wG) = zy, 
i=1 


We then estimate Q using Sn(f). For instance, the maximum likelihood estimator is QMLE = 
S, (f). The ¢-regularized estimator is 


nN 


a, = arg min {tr (05,0) —toglo] +AA } (9) 


where A is a regularization parameter, and ||O||; = } jsx |Qjx|. The estimated graph is then En = 
{(i.k): Qu #0). 

The nonparanormal is analogous to a sparse additive regression model (Ravikumar et al., 2009a), 
in the sense that both methods transform the variables by univariate functions. However, while 
sparse additive models use a regularized risk criterion to fit univariate transformations, our nonpara- 
normal estimator uses a two-step procedure: 


1. Replace the observations, for each variable, by their respective normal scores, subject to a 
Winsorized truncation. 


2. Apply the graphical lasso to the transformed data to estimate the undirected graph. 


The first step is non-iterative and computationally efficient, with no tuning parameters; it also 
makes the nonparanormal amenable to theoretical analysis. 

Starting with the model in (2), another possibility would be to parametrize each f; according to 
some parametric class of monotone functions such as the Box-Cox family, and then find the maxi- 
mum likelihood estimates of (Q, fj, ... fp) in that class. This might lead to estimates of f; that depend 
on Q, and vice versa, and the estimation problem would not in general be convex. Alternatively, due 
to (4), the marginal information could be used to estimate the parameters. Our nonparametric ap- 
proach to estimating the transformations has the advantages of making few assumptions and being 
easy to compute. In the following section we analyze the theoretical properties of this estimator. 


5. Theoretical Results 


In this section we present our theoretical results on risk consistency, model selection consistency, 
and norm consistency of the covariance X and inverse covariance Q. From Lemma 3, the estimate 
of the graph does not depend on o;, j € (1,..., p) and u, so we assume that o; = 1 and u = 0. Our 
key technical result is an analysis of the covariance of the Winsorized estimator defined in (6), (7), 
and (8). In particular, we show that under appropriate conditions, 


max Sn F) ik — Saf) ik| = op(1) 


where S,( f) jk denotes the (j,k) entry of the matrix. This result allows us to leverage the recent 
analysis of Rothman et al. (2008) and Ravikumar et al. (2009b) in the Gaussian case to obtain 


consistency results for the nonparanormal. More precisely, our main theorem is the following. 
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Theorem 4 Suppose that p = n5 and let f be the Winsorized estimator defined in (7) with 8, = 


——À—————. D 
4n!/4 J/rlogn eine 
48 
C = —_( v24-1)( 2). 10 
M Jit (M +2) (10) 
For some M > 2 ($4- 1). 
log plog? 
Then for any € > Cy I and sufficiently large n, we have 
n 


P (max |s Pu- al >2e) 


] 2exp | 21 nie’ 2exp | 21 ns (1) 
——————-2ex o + 2ex og p— +o(1). 
2 \/tlog(np) 5 i 12321? log?n p ER 8tlogn 











The proof of the above theorem is given in Section 7. The following corollary is immediate, and 
specifies the scaling of the dimension in terms of sample size. 


Corollary 5 Let M > max{157,2€ +1}. Then 


~ log plog*n 
Su(Px—SalF)x| > 2C yf ETE. | = o(1). 

~ log plog? n 
SG) - SG) = OE m 


The following corollary yields estimation consistency in both the Frobenius norm and the 42- 
operator norm. The proof follows the same arguments as the proof of Theorem 1 and Theorem 2 
from Rothman et al. (2008), replacing their Lemma 1 with our Theorem 4. 





P | max 
ik 


Hence, 





max 
jJ 


For a matrix A = (aij), the Frobenius norm ||- ||r is defined as ||A||p = Eija The 4%- 
operator norm ||- ||; is defined as the magnitude of the largest eigenvalue of the matrix, ||A||; = 
max.4,—1 ||Ax||2. In the following, we write an = by if there are positive constants c and C indepen- 


dent of n such that c < ay/by € C. 


Corollary 6 Suppose that the data are generated as X?) ~ NPN (uo, Xo, fo), and let Qo = D JF 
the regularization parameter Àn is chosen as 


log plog?n 


Àn x 2Cy 2,12 


where Cy is defined in Theorem 4. Then the nonparanormal estimator Q, of (9) satisfies 


lÊ, — clle = Op y (rpo yoe 
E 








jua 
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and 


s(log plog? n) 


|, — Qo||2 = Op ui ; 


where 
s z Card(((i, j) € {1,...,p} x {1,..., p) Qo(i, j) #0, i 2 j]) 
is the number of nonzero off-diagonal elements of the true precision matrix. 


To prove the model selection consistency result, we need further assumptions. We follow 
Ravikumar (2009) and let the p? x p? Fisher information matrix of Xj be r = Xo Q Xo where & 
is the Kronecker matrix product, and define the support set S of Qo = Xo l as 


Sz {(i,j) € {1,...,p} x {1,..., p} | Qo(i, j) 20]. 


We use S° to denote the complement of S in the set {1,...,p} x {1,...,p}, and for any two 
subsets T and T’ of {1,...,p} x {1,...,p}, we use Irr to denote the sub-matrix with rows and 
columns of T indexed by T and T’ respectively. 


Assumption 1 There exists some o. € (0, 1], such that |Us-s(Ess) ||, < 1 — oc. 


As in Ravikumar et al. (2009b), we define two quantities Ky, = |[Eo||.; and Kr = || (Ess) tle. 
Further, we define the maximum row degree as 


n 


Assumption 2 The quantities Kyo and Ky are bounded, and there are positive constants C such that 


log?^n 
n!/2 





min |Qo(j,k)| > C 
mn o(j,k)| 2 


for large enough n. 


The proof of the following corollary uses our Theorem 4 in place of Equation (12) in the analysis 
of Ravikumar et al. (2009b). 


Corollary 7 Suppose the regularization parameter is chosen as 


log plog?n 


Àn x 2Cm 21/2 


where C(M,n, p) is defined in Theorem 4. Then the nonparanormal estimator Q, satisfies 
P (s (Gn,20)) » 1—o(1) 
where G (Qn, Q0) is the event 


{sign (a(k) = sign (Qo(j,k)), Yj,k€ s} , 
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Our persistency (risk consistency) result parallels the persistency result for additive models 
given in Ravikumar et al. (2009a), and allows model dimension that grows exponentially with sam- 


ple size. The definition in this theorem uses the fact (from Lemma 11) that sup, ^! (F, (x)) < 


v2logn when 8, = 1/(4n!/4 J/rlogn). 
In the next theorem, we do not assume the true model is nonparanormal and define the popula- 
tion and sample risks as 


R(,Q) = $ {u [ET] -tosio| — plog(2n)} 


REO) = 4 (w[08,(/)]-10g|0|—plog(2n)}. 


Theorem 8 Suppose that p < e for some & « 1, and define the classes 


M, 
Cn 


[f :R—>R : f is monotone with ||f||.. < C logn } 
{9 : Q7! x Z4]. 


Let Q, be given by 


a~ 


Q, = argmin (rr (05,0) -log|o|] : 


Then 





ER eu logn 
R(f,OQ,- inf R(f,Q) = Op | Ly 
vate (f.QeM?eG, Vas) e | m 


Hence the Winsorized estimator of (f,Q) with 6, = 1/(4n'/^ J/xlogn) is persistent over C, when 
L,—o (n0 -3/2 / vlogn). 


The proofs of Theorems 4 and 8 are given in Section 7. 


6. Experimental Results 


In this section, we report experimental results on synthetic and real data sets. We mainly compare 
the /;-regularized nonparanormal and Gaussian (paranormal) models, computed using the graphical 
lasso algorithm (glasso) of Friedman et al. (2007). The primary conclusions are: (1) When the data 
are multivariate Gaussian, the performance of the two methods is comparable; (ii) when the model 
is correct, the nonparanormal performs much better than the graphical lasso in many cases; (iii) for 
a particular gene microarray data set, our method behaves differently from the graphical lasso, and 
may support different biological conclusions. 

Note that we can reuse the glasso implementation to fit a sparse nonparanormal. In particular, 
after computing the Winsorized sample covariance S,, (f). we pass this matrix to the glasso routine 
to carry out the optimization 


nN 


Q, = arg min {tr (25,(/) — log |Q| + nll} 


2305 


LIU, LAFFERTY, AND WASSERMAN 


6.1 Neighborhood Graphs 


We begin by describing a procedure to generate graphs as in (Meinshausen and Bühlmann, 2006), 
with respect to which several distributions can then be defined. We generate a p-dimensional sparse 
graph G = (V,E) as follows: Let V = {1,...,p} correspond to variables X = (X;,...,Xp). We 


associate each index j with a point go € [0, 1? where 
(o, d y 9 ~ Uniform|0, 1] 


for k = 1,2. Each pair of nodes (i, j) is included in the edge set E with probability 


ing 1 yi yl 
(Gne) = ox | - ) 
(1) | (2) 


where y; = (y; y; ) is the observation of (v(), v) and ||- ||, represents the Euclidean distance. 
Here, s = 0.125 is a parameter that controls the sparsity level of the generated graph. We restrict the 
maximum degree of the graph to be four and build the inverse covariance matrix Qo according to 





1 ifi=j 
Qo(i,j)= 4 0.245 if (i,j) €E 
0 otherwise, 


where the value 0.245 guarantees positive definiteness of the inverse covariance matrix. 
Given Qo, n data points are sampled from 


X(),..., X ~ NPN (uo, Eo, fo) 
where uo = (1.5: 15), 29 = Q9 l For simplicity, the transformation functions for all dimensions 


are the same, fı =... = fp = f. To sample data from the nonparanormal distribution, we also require 
g = f~}; two different transformations g are employed. 


Definition 9 (Gaussian CDF Transformation) Let go be a one-dimensional Gaussian cumulative 
distribution function with mean us, and the standard deviation Og,, that is, 


go(r) 2 (5). 


Og 


We define the transformation function g ; — f Pu for the j-th dimension as 


oz) - [eve (2) ar 


gj) = oj : ; Huj 
y Slow- foo (S24) ar) e (i) 


where 6; = Xo(Jj, j). 
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Figure 3: The power and cdf transformations. The densities are estimated using a kernel density 
estimator with bandwidths selected by cross-validation. 


Definition 10 (Symmetric Power Transformation) Let go be the symmetric and odd transformation 


given by 
golt) = sign(t)|r* 


where Q > 0 is a parameter. We define the power transformation for the j-th dimension as 


(z;) 86; go(z; — uj) 
8j(Zj) = 0j y / 8o (22) 


These transformation are constructed to preserve the marginal mean and standard deviation. In 
the following experiments, we refer to them as the cdf transformation and the power transforma- 
tion, respectively. For the cdf transformation, we set us, = 0.05 and Op, = 0.4. For the power 
transformation, we set Q = 3. 

To visualize these two transformations, we sample 5000 data points from a one-dimensional nor- 
mal distribution N (0.5, 1.0) and then apply the above two transformations; the results are shown in 
Figure 3. It can be seen how the cdf and power transformations map a univariate normal distribution 
into a highly skewed and a bi-modal distribution, respectively. 
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n — 200 
Figure 4: Regularization paths for the glasso and nonparanormal with n — 500 (top) and n — 200 
(bottom). The paths for the relevant variables (nonzero inverse covariance entries) are 
plotted as solid (black) lines; the paths for the irrelevant variables are plotted as dashed 
(red) lines. For non-Gaussian distributions, the nonparanormal better separates the rele- 
vant and irrelevant dimensions. 


To generate synthetic data, we set p — 40, resulting in (2) +40 = 820 parameters to be es- 
timated, and vary the sample sizes from n = 200 to n = 1000. Three conditions are considered, 
corresponding to using the cdf transform, the power transform, or no transformation. In each case, 
both the glasso and the nonparanormal are applied to estimate the graph. 
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6.1.1 COMPARISON OF REGULARIZATION PATHS 


We choose a set of regularization parameters A; for each A € A, we obtain an estimate Q, which 
is a 40 x 40 matrix. The upper triangular matrix has 780 parameters; we vectorize it to get a 
780-dimensional parameter vector. A regularization path is the trace of these parameters over all 
the regularization parameters within A. The regularization paths for both methods are plotted in 
Figure 4. For the cdf transformation and the power transformation, the nonparanormal separates the 
relevant and the irrelevant dimensions very well. For the glasso, relevant variables are mixed with 
irrelevant variables. If no transformation is applied, the paths for both methods are almost the same. 


6.1.2 ESTIMATED TRANSFORMATIONS 


For sample size n = 1000, we plot the estimated transformations for three of the variables in Figure 
5. Itis clear that Winsorization plays a significant role for the power transformation. This is intuitive 
due to the high skewness of the nonparanormal distribution in this case. 








































































































































































































cdf power linear 
w J 
= —— estimated pap m ^ 
*— true — estimated "d — estimated 2 
e *-|-:- tne we J |>= true 
rz] 
o 
rc r NT = =| 
= = D 
o 
[z] 
1o fe J 
S 
1 
Z 
a o M 
- Y - E 
' T T T T T T T T T T 
-5 0 5 10 -1 0 2 3 4 5 
x1 x1 x1 
"n 
~ — estimated ar Nd 
rs += true — estimated a — estimated ad 
= To | += tis pa = | += true 
ri 
S 
a 4 4 
e E ò 
o 
o 
1o im E 
S 
1 z^ 
-` z` 
z Tomar T 
i T T T T T T T T T T T T T T T T 
-2 -1 0 1 2 -10 -5 0 5 10 -2 -1 0 1 2 3 4 5 
x2 x2 x2 
u 
u — estimated ane 
m *- true — estimated Qa —— estimated T2 
2 +- |-— true "d 4|-- true Ki 
ro 
o 
a 4 4 
e 2 $ 
o 
o 
rz] e J 
S 
l d 
Sa qy j a 
i : 











x3 











x3 

















Figure 5: Estimated transformations for the first three variables. Winsorization plays a significant 
role for the power transformation due to its high skewness. 
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Figure 6: Boxplots of the oracle scores for n — 1000,500,200 (top, center, bottom). 


6.1.3 QUANTITATIVE COMPARISON 


To evaluate the performance for structure estimation quantitatively, we use false positive and false 
negative rates. Let G = (V, E) be a p-dimensional graph (which has at most (5) edges) in which there 


are |E| = r edges, and let G = (V, E^) be an estimated graph using the regularization parameter A. 
The number of false positives at A is 


FP(A) = number of edges in £^ not in E 
The number of false negatives at À is defined as 

FN(A) = number of edges in E not in £^. 
The oracle regularization level À* is then 


À* = arg min (FP(À) + FN(A)}. 
AEA 


The oracle score is FP(A*) + FN(A*). Figure 6 shows boxplots of the oracle scores for the two 
methods, calculated using 100 simulations. 
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To illustrate the overall performance of these two methods over the full paths, ROC curves are 


shown in Figure 7, using 


The curves clearly show how the performance of both methods improves with sample size, and that 
the nonparanormal is superior to the Gaussian model in most cases. 
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Figure 7: ROC curves for sample sizes n = 1000, 500, 200 (top, middle, bottom). 











Let FPE = FP(A*) and FNE = FN(A*), Tables 1, 2, and 3 provide numerical comparisons of 
both methods on data sets with different transformations, where we repeat the experiments 100 
times and report the average FPE and FNE values with the corresponding standard deviations. It's 
clear from the tables that the nonparanormal achieves significantly smaller errors than the glasso if 
the true distribution of the data is not multivariate Gaussian and achieves performance comparable 
to the glasso when the true distribution is exactly multivariate Gaussian. 

Figure 8 shows typical runs for the cdf and power transformations. It's clear that when the 
glasso estimates the graph incorrectly, the mistakes include both false positives and negatives. 
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Nonparanormal glasso 
n FPE (sd(FPE)) NE (sd(FNE)) FPE (sd(FPE) FNE  (sd(FNE)) 
1000 0.10 (0.3333) 0.05 (0.2190) 3.73 (2.3904) 724 (3.2910) 
900 0.18 (0.5389) 0.16 (0.4197) 3.31 (2.4358) 8.94 (3.2808) 
800 0.16 (0.5069) 0.23 (0.5659) 3.80 (2.9439) 9.91 (3.4789) 
700 0.26 (0.6295) 0.43 (0.7420) 3.45 (2.5519) 12.26 (3.5862) 
600 0.33 (0.6039) 0.41 (0.6371) 3.31 (2.8804) 14.25 (4.0735) 
500 0.58 (0.9658) 1.10 (1.0396) 3.18 (2.9211) 17.54 (4.4368) 
400 0.71 (1.0569) 1.52 (1.2016) 1.58 (2.3535) 21.18 (4.9855) 
300 1.37 (1.4470) 2.97 (2.0123) 0.67 (1.6940) 23.14 (5.0232) 
200 2.03 (1.9356) 7.13 (3.4514) 0.01 (0.1000) 24.03 (4.9816) 


Table 1: Quantitative comparison on the data set using the cdf transformation. For both FPE and 
FNE, the nonparanormal performs much better in general. 


























Nonparanormal glasso 
n FPE (sd(FPE)) FNE  (sd(FNE) FPE (sd(FPE) FNE (sd(FNE)) 
1000 0.27 (0.7086) 0.35 (0.6571) 2.89 (1.9482) 4.97 (2.9213) 
900 0.38 (0.6783) 0.41 (0.6210) 2.98 (2.3697) 5.99 (3.0467) 
800 025 (0.5751) 0.73 (0.8270) 4.10 (2.7834) 6.39 (3.3571) 
700 0.69 (0.9067) 0.90 (1.0200) 4.42 (2.8891) 8.80 (3.9848) 
600 0.92 (1.2282) 1.59 (1.5314) 4.64 (3.3830) 10.58 (4.2168) 
500 1.17 (1.3413) 2.56 (2.3325) 4.00 (2.9644) 13.09 (4.4903) 
400 1.88 (1.6470) 4.97 (2.7687) 3.14 (3.4699) 17.87 (4.7750) 
300 2.97 (2.4181) 7.85 (3.5572) 1.36 (2.3805) 21.24 (4.7505) 
200 2.82 (2.6184) 14.53 (4.3378) 0.37 (0.9914) 24.01 (5.0940) 


Table 2: Quantitative comparison on the data set using the power transformation. For both FPE and 
FNE, the nonparanormal performs much better in general. 


6.1.4 COMPARISON IN THE GAUSSIAN CASE 


The previous experiments indicate that the nonparanormal works almost as well as the glasso in 
the Gaussian case. This initially appears surprising, since a parametric method is expected to be 
more efficient than a nonparametric method if the parametric assumption is correct. To manifest 
this efficiency loss, we conducted some experiments with very small n and relatively large p. For 
multivariate Gaussian models, Figure 9 shows results with (n, p,s) = (50,40, 1/8), (50, 100, 1/15) 
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Nonparanormal glasso 
n FPE (sd(FPE) FNE (sd(FNE)) FPE (sd(FPE)) FNE  (sd(FNE)) 
1000 0.10 (0.3333) 0.05 (0.2190) 0.09 (0.3208) 0.06 (0.2386) 
900 0.24 (0.7537) 0.14 (0.4025) 0.22 (0.6447) 0.15 (0.4113) 
800 0.17 (0.4277) 0.16 (0.3949) 0.16 (0.4431) 0.19 (0.4191) 
700 0.25 (0.6871) 0.33 (0.8534) 0.29 (0.8201) 0.27 (0.7501) 
600 0.37 (0.7740) 0.36 (0.7456) 0.36 (0.7722) 0.37 (0.6459) 
500 028 (0.5874) 0.46 (0.7442) 0.25 (0.5573) 0.45 (0.6571) 
400 0.55 (0.8453) 1.37 (1.2605) 0.47 (0.7713) 1.35 (1.2502) 
300 124 (1.3715) 3.07 (1.7306) 0.98 (1.2058) 3.04 (1.8905) 
200 1.62 (1.7219) 5.89 (2.7373) 1.55 (1.6779) 5.62 (2.6620) 


Table 3: Quantitative comparison on the data set without any transformation. The two methods 
behave similarly, the glasso is slightly better. 


and (30,100,1/15). From the mean ROC curves, we see that nonparanormal does indeed behave 
worse than the glasso, suggesting some efficiency loss. However, from the corresponding boxplots, 
the efficiency reduction is relatively insignificant. 


6.1.5 THE CASE WHEN p >n 


Figure 10 shows results from a simulation of the nonparanormal using cdf transformations with n = 
200, p = 500 and sparsity level s = 1/40. The boxplot shows that the nonparanormal outperforms 
the glasso. A typical run of the regularization paths confirms this conclusion, showing that the 
nonparanormal path separates the relevant and irrelevant dimensions very well. In contrast, with the 
glasso the relevant variables are “buried” among the irrelevant variables. 


6.2 Gene Microarray Data 


In this study, we consider a data set based on Affymetrix GeneChip microarrays for the plant Ara- 
bidopsis thaliana, (Wille et al., 2004). The sample size is n = 118. The expression levels for each 
chip are pre-processed by log-transformation and standardization. A subset of 40 genes from the 
isoprenoid pathway are chosen, and we study the associations among them using both the para- 
normal and nonparanormal models. Even though these data are generally treated as multivariate 
Gaussian in the previous analysis (Wille et al., 2004), our study shows that the results of the non- 
paranormal and the glasso are very different over a wide range of regularization parameters. This 
suggests the nonparanormal could support different scientific conclusions. 


6.2.1 COMPARISON OF THE REGULARIZATION PATHS 


We first compare the regularization paths of the two methods, in Figure 11. To generate the paths, 
we select 50 regularization parameters on an evenly spaced grid in the interval [0.16, 1.2]. Although 
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Figure 8: Typical runs for the two methods for n — 1000 using the cdf and power transformations. 
The dashed (black) lines in the symmetric difference plots indicate edges found by the 
glasso but not the nonparanormal, and vice-versa for the solid (red) lines. 


the paths for the two methods look similar, there are some subtle differences. In particular, variables 
become nonzero in a different order, especially when the regularization parameter is in the range 
À € [0.2,0.3]. As shown below, these subtle differences in the paths lead to different model selection 
behaviors. 


6.2.2 COMPARISON OF THE ESTIMATED GRAPHS 


Figure 12 compares the estimated graphs for the two methods at several values of the regularization 
parameter A in the range [0.16,0.37]. For each A, we show the estimated graph from the nonpara- 
normal in the first column. In the second column we show the graph obtained by scanning the full 
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Figure 9: For Gaussian models, comparison of boxplots of the oracle scores and ROC curves for 
small n and relatively large p. The ROC curves suggest some efficiency loss of the non- 
paranormal; however, the corresponding boxplots indicate this loss is insignificant. 
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Figure 10: For the cdf transformation with n — 200, p — 500,s — 1/40, comparison of the boxplots 
and a typical run of the regularization paths. The nonparanormal paths separate the 
relevant from the irrelevant dimensions well. For the glasso, the relevant variables are 
"buried" in irrelevant variables. 


regularization path of the glasso fit and finding the graph having the smallest symmetric difference 
with the nonparanormal graph. The symmetric difference graph is shown in in the third column. The 
closest glasso fit is different, with edges selected by the glasso not selected by the nonparanormal, 
and vice-versa. Several estimated transformations are plotted in Figure 13, which are are nonlinear. 
Interestingly, several of the differences between the fitted graphs are related to these variables. 
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Figure 11: The regularization paths of both methods on the microarray data set. Although the paths 
for the two methods look similar, there are some subtle differences. 

7. Proofs 

We assume, without loss of generality from Lemma 3, that u; = 0 and 6; = 1 for all j = 1,...,p. 

Thus, define f;(x) =~ !(F;(x)) and f;(x) = P7! (Fj(x)), and let g; = fj. 

7.1 Proof of Theorem 4 

We start with some useful lemmas; the first is from Abramovich et al. (2006). 

Lemma 11 (Gaussian Distribution function vs. Quantile function) Let ® and $ denote the distribu- 


tion and density functions of a standard Gaussian random variable. Then 


«0 <1 ao) « 90 Ef] 


and 


1 
o (®=!(m)) 


dard cur. 
a(n) = 210g (—) rev an 


Lemma 12 (Distribution function of the transformed random variable) For any & € (—%,°°) 


o7 (F; (eia isg)) =a. oer 


($7 y(m- 


Also, for * = 0.99, we have 


where r(m) € [0, 1.5]. 
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Figure 12: The nonparanormal estimated graph for three values of à = 0.2448, 0.2661, 0.30857 
(left column), the closest glasso estimated graph from the full path (middle) and the 
symmetric difference graph (right). 
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Figure 13: Estimated transformations for the microarray data set, indicating non-Gaussian 
marginals. The corresponding genes are among the nodes appearing in the symmetric 
difference graphs above. 





Proof The statement follows from 
Fj(t) = P(X; < 1) - P(sj(Z)) <1) = P(Z; < a; (0) - 9 (8710). (12) 


which holds for any f. | 
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Lemma 13 (Gaussian maximal inequality) Let W,,...,W, be identically distributed standard Gaus- 
sian random variables (do not have to be independent). Then for any & > 0 


1 
. / < ? 
E (maxim i cog” ) T n/2-1 J/2nadlogn 





Proof Using Mill’s inequality, we have 


n 
o( /alogn) 1 
P Wi> Jal «y P(w Jal )s - 
(mox 2d alen) < È lees aa Jalogn n9/2-1 ,/Inalogn 


from which the result follows. E 








Lemma 14 For any & > 0 that satisfies | — 6, —® ( V Qalog n) > 0 for all n, we have 


P [F (s; ( vatoen)) 31-8] see (1-8,-@( Vaiogn)) b. — a» 
and 
P [F (s; (- Varogn)) «&] sew[-m(i-&-e(vomen)) . —— ao 


Proof Using Hoeffding's inequality, 


=f (VaR) >) 
- r (u( Vem) -a (e (v5) 1-&- 5 (v) 
< exp -2n (1-&-5 (si (vaenn))) - 


Equation (13) then follows from equation (12). The proof of equation (14) uses the same argument. 
a 


1 
Now let M > 2 and set B = 7 We split the interval 


lei v Mlogn),g;( v/MIogn) 


into two parts, the middle 
at, (s (7 vB) e ( Visas) 
and ends 
z, = [es (= varisen) «e (- visas) ]u [e ( loan) e ( vro] 


The behaviors of the function estimates in these two regions are different, so we first establish 
bounds on the probability that a sample can fall in the end region £n. 


2318 


THE NONPARANORMAL 


2 
Lemma 15 Let A= "E. VM — \/B). Then 


P (Xij € t) <A ye, Vj € {1,...,p}. 


Proof Using Equation (12) and the mean value theorem, we have 
P (Xij E€ En) 
= P(X; € [sj vBiogn),g;( /Mlogn)]) +P (x1; € [s;( VMiogn),g;(- vBlogn)] ) 
F; (ej /Miogn)) — F; (s;( v/Blogn)) + F; (g;(— vBlogn)) — F; (s;(- v/Mlogn)) 
= 2(®( /Miogn)-®( \/Blogn)) 
26 ( VBlogn) ( v/Mlogn — VBlogn) 


The result of the lemma follows directly. a 





IA 


We next bound the error of the Winsorized estimate of a component function over the end region. 


Lemma 16 For all n, we have 


sup & (F(t) -& 1 c) < /2(M+2)logn, Vj € (1,...,p). 


t€, 


Proof From Lemma 12 and the definition of 7, we have 





SPI ))e |o. v/Mlogn|. 
Given the fact that 5 : have F;(t) € Loc. hetetore fron Hada 
iven the fact that 6, = ——7—————, we have F; —,1 —- |J. Therefore, from Equation 
^ —— 4n!/^ J/xlogn g n n E 

(11), 

sup |! (£i)| € o. v2logn) 

tcE, 
The result follows from the triangle inequality and VM + V2 < v 2(M +2). a 


Now for any € > 0, we have 


P (max 








2 fil Xy) fi (X ik) — fiX if) Fel Xx) — i) Fa) ini) fe) } 


=) 





> ze) 


n 


SeS (Pa| > 2e) 
ie i) fats k)— fj (X ij) f (Xi 9) 


bil 
max 

jk |n 
+P (max 


IA 











Un (fj) Hn ()- ufu) > e) à 
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We only need to analyze the rate for the first term above, since the second one is of higher order 
(Cai et al., 2008). Let 


Ai( j,k) = fi (Xi) fic Xin) — fj (Xiz) f Xie) 
and 
0,,(j,k) = Filt) Fels) — Fit) fis). 
We define the event Æ, as 
A,= fs; (- v/Mlogn) S codes ( V/Mlogn) es bur 


Then, by Lemma 13, when M > 2(€ + 1), we have 














1 
P(AS) <P max (Xip) > J2log(n | — 
(Fn) ML MN i) el 5) 2 /xlog(np) 
Therefore 
TE . l1 ; é 
cpl asl yen eka he 
= ik |n& "hs , 2 /nlog(np) 








Thus, we only need to carry out our analysis on the event Æ. On this event, we have the following 


decomposition: 
1 n 
P|max >E, An 
jk |n 


-Y A;(j,k) 
< P Het »» AG ) > © +P d E IG, E)» È 
Ss iJ) 4 iJo 4 








i=l 
ik n ik n 
DA OU XM XE Ma DE OU Xe, Xe, 


1 € 
+ 2P | max — ) |Ai(j,k)| > — 
ik Ny cary. 4 

ij E Mn Xik € En 


We now analyze each of these terms separately. 


log plog? 
Lemma 17 On the event Ap, let B = 1/2and£ > Cu 4| SP PE" p "then 
n 


P (ss Y ING.» j eot). 


Ben Xij €En Xi € En 
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Proof We define 
| ne 


Ss 
i 8A /logn 


with the same parameter A as in Lemma 15. Such a 0, guarantees that 


" 28r — nA zi su 
n n 








By Lemma 15, we have 


1 n 
P eX {Xij EEn Xik€ En} > 40, &) £ (f lix;ez) > 40, z) 


»( (Aix, P(Xjj€ Ka) > jo, "POJE 2) 


ne logn 
P 1 — P(X; A ; 
(É (teeter) > n E) 


Using the Bernstein’s inequality, for B = 


I 
Me 


ll 
oe 


IA 





Ms 


oe 





logn 
Y (liyer) -P (Xj € 5) >nA § 


| 
n 
iM: 
3 
T 


12 £ 
P|- lix. ; > — < P 
(GÈ {Xij EEn Xie.) 5) 


cın? Blogn 


=o(1 
arte o(1), 


< 





oO 

x 

"3 
4 7X 


where c1,c5,c3 > 0 are generic constants. 
Therefore, 


1 £ 
P|max- ) |Ai(j,k)| > = 
| Ik N y ee Xt. 4 


1 £ 
P (mt Y |Ai(j,k)| » 2,max sup |@,5(j,k)| > «) 
4 j 


Ik NY ca Xue fa * te ,sct, 


1 
+P (mp ) NGEI pmax sup |8,,(j,E)] € «) 


DF N y ce XE, Kk reset, 


^ 


jk tE En SEEn 


. 14 £ 
< P (rax sup [®;s(j,k)| > «) +P (aene > Š) 
i=l 


jk tE En, SE En 


P (rax sup [®;s(j,k)| > «) +o(1). 
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Now, we analyze the first term 


IA 


P{ max sup |@,s5(j,k)| > 6: 
Jk t€'E,,SctE, 


p^? | Sup (6; (j, K)| > «) 


t€ En SE En 


zi sup |fi(t)fe(s) — fj) G)] > «) i 


t€ En SE En 


By adding and subtracting terms f;(t) and f,(t), we have 


:( sup (ofi) n ons 91) 


t€ E Se, 
t€ En SE En 3 


< e{ sup (Fo ne» -r> Sr 


t€ En sE En 3 


e sup (o ny tie S. 





e sup (Fie) n) Ato» S. 


tE En, se E, 
The first term can further be decomposed to be 


e( sup (Fi Hoy -A> $) 


tE En sek, 
< P (spi n 5 +2 (nein nen 1) 
tek, 


SEEn 


Also, from the definition of £,, we have 


P FE) = > l'o] < /Mlogn. 
tc 


t€£, 


2 
Since € > Cy 4/ Ser we have 


0; .— nP/?g 3 Cy \V/log plog?n ig 
3 24A,/logn 24A,/logn 








(M +2)logn. 


This implies that 


01 01 
=> 2(M 4-2)l d ———— > 2(M 4-2)1 y 
V3 5 AMET Ign discere og 


Then, from Lemma 16, we get 


= (seio - m > 2) =0 
te, 
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and 
x 0; 
P| sup |(50)—f;0)l LG) > = | =0. 
t€ En SE En 
The claim of the lemma then follows directly. m 


Remark 18 From the above analysis, we see that the data in the tails doesn't affect the rate. Using 
exactly the same argument, we can also show that 


1 E€ 
P = Ai j,k — | =o(1). 
max. IA) > GJ = 00) 
Xij E Mn Xi € En 


Hog plog? 
Lemma 19 On the event A,, let B = 1/2 and € > Cy ee We have 
n 


: Ai(j,k)| > = | €2exp | 21 aiia 2exp | 21 nt 
P | max — (ik) = |S 2ex og p —- ——__—— | +2ex og p — ——— |. 
jk e 4 id 123212 log? n p Er 81logn 


Proof We have 


Pima- E AGI «e sup (ofi) n nr §) 


ik T! Xue 9M, Xe M, tE M, se M, 


< 2i sup (FO — fj() f G) fls) | > 5) 


tE M, sc Mn 12 


eae sup (o n) Ain i): 


tE M, sc M, 12 


Further, since 


sup [/j(r)| = sup |e;' (| = v/Blogn 


tc M, tc M, 

and sup — ((fj(t) — fj(t)) Us) — fe(s))| is of higher order than sup,cag, seag |(Fi(t) — AO 

t€ Mn SE Mn 

pa € 
[fi (s)|, we only need to analyze the term P | sup,ca; (AO — HON > ————— ]. 
" 12 \/Blogn 
1 
Since 6, = , using Mill’s inequality we have 





4nB/2 ,/2nBlogn 
op = OC PIGERI c. e Toss) 


2Jplogn — 





2323 


LIU, LAFFERTY, AND WASSERMAN 


This implies that 


1—6,—®( JBlogn) > 6, > 0. 


Using Lemma 14, we have 


pP (8 (e; ( VBlogn) J >1- òn) < plexp (—2n8]) — exp (217 — dies) (15) 


(16nBlogn) 
and 
1- 
owe l.l. Nc: 
p »(£, (e;( VBlogn) ) < ,) < exp c ass) : (16) 


Define an event B, as 


8, = [8, < Ê (s;( VBlogn)) x 1-8; — 1... p] 


From (15) and (16), it is easy to see that 





21/2 
P(B,) € 2exp du rim : 


From the definition of F,, we have 


2 Tx. de € 
d (spiri fe» 12 uz) 





< pP| sup |87 (Fj)) -87 E| > ——_.,, | + P(82- 
t€ 9M, 12 J/plogn 
2p D! (F(t) —&!(F,(1)) E 2exp | 21 n 
x su " (Bt) -e- |» ——— +2ex og p — ———— |. 
á her 7 : 12 J/plogn p BE 8rlogn 
Define 


Tin =max [F, (e; ( vBlogn)) A-&) and Ty, =1 - mini; (e; (- v Blogn)) bn}. 
From Equation (12) and the fact that 1 — 6, > «b ( \/ Blog n), we have that 


Tin = Ton =1— õn. 


Thus, by the mean value theorem, 


P (sp ©! (FA) — © 1 (F)()| > aie) 


< p ( (01) (max (Tin tn we [E r0 > a 


p (ea - 8) sup (FC) -50| > a 
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Finally, using the Dvoretzky-Kiefer- Wolfowitz inequality, 


P (sn D7 (£()) -27 c0) > zm] 


IA 





= e 
: E 5(0-E|* ayaan mas) 


IA 





ne? 
z ( " Haflogn ("YC y) | 


Furthermore, by Lemma 11, 


(7! (1 —8,) = : < : = Vi (s-) = 8nnP^? \/Blogn. 


—1 urs 
teen E) 
à, 
This implies that 


?p sup lo! (£i) D7! (rj) > E « 2exp | 21o nhe 
du. ; / Buten. P (7E q039 1025 J- 





In summary, we have 





P I y wG5SI»t|z2 (a nie? 2 (a Eca ) 
max — iK) T4] &2exp| 4108 p 2 + 2exp | -logp— 
nem XijE M, Xa €t, 4 123272 log“ n 8xlogn 
This finish the proof. E 


The conclusion of Theorem 4 follows from Lemma 17 and Lemma 19. 


7.2 Proof of Theorem 8 


Proof First note that the population and sample risks are 
! T 
RFQ) = 5 {tr [QE(S(X)f(X)"] -Iog|O] — plog(2) } 
a 1 
R(F,Q) = 5 {tr[QS,(f)] — log |Q| — plog(27)} . 
Therefore, for all (f,Q) € MP € C, we have 


IRF, 2) —R(f,2)| 


; rfe (84771 - 5. 0)]| 


1 
€ 5l9l,max sup [EXS Xk) — Saf) jal 
X ffe, 


Ly 
max sup [ECFj(X5) fx (Xx) — Sf) jal- 
Jk Fj REM, 


^ 


IA 
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Now, if F isa class of functions, we have 
» C Jy CIF |l... F) 
E| sup |u(g) - (2) | < —— — (17) 
(n vn 


for some C > 0, where F (x) = supgecr |g(x)|, u(g) = E(g(X)) and u(g) = n ! Y? 4e(X;) (see Corol- 
lary 19.35 of van der Vaart 1998). Here the bracketing integral is defined to be 


5 
1y(8.F) = f logus F) du 


where log Ni (£, F ) is the bracketing entropy. For the class of one dimensional, bounded and mono- 
tone functions, the bracketing entropy satisfies 


logNy (E, M ) <K (i) 


for some K > 0 (van der Vaart and Wellner, 1996). 
Now, let P, p be the class of all functions of the form m(x) = fj(x;) fi(xx) for j,k € {1,..., p}, 
where f; € M, for each j. Then the bracketing entropy satisfies 


1 
log Ni (C V/logn, Pap) < 2logp-- K (i) 


and the bracketing integral satisfies Jy (C V/logn, Py») = O( vlognlog p). It follows from (17) and 
Markov's inequality that 


lognlo logn 
max sup |$,(/) — (X) f 9)| = Op | V =" °F? | = Op LJ S 
jk fj fce M n n 


Therefore, 





A Ln Vlogn 
sup IR(f,Q) —R(f,Q)| = Op (eR . 
(f,Q)EM?P ec, a 


As a consequence, we have 


R(f',Q) 


^ 
A 


IA 
E 





IA 
e») 
x 

Q 
+ 
Q 
"v 
PES 
3 
Js 
© 
SS 


and the conclusion follows. [| 
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8. Concluding Remarks 


In this paper we have introduced the nonparanormal, a type of Gaussian copula with nonparametric 
marginals that is suitable for estimating high dimensional undirected graphs. The nonparanormal 
can be viewed as an extension of sparse additive models to the setting of graphical models. We 
proposed an estimator for the component functions that is based on thresholding the tails of the 
empirical distribution function at appropriate levels. A theoretical analysis was given to bound the 
difference between the sample covariance with respect to these estimated functions and the true 
sample covariance. This analysis was leveraged with the recent work of Ravikumar et al. (2009b) 
and Rothman et al. (2008) to obtain consistency results for the nonparanormal. Computationally, 
fitting a high dimensional nonparanormal is no more difficult than estimating a multivariate Gaus- 
sian, and indeed one can exploit existing software for the graphical lasso. Our experimental results 
indicate that the sparse nonparanormal can give very different results than a sparse Gaussian graph- 
ical model. This suggests that it may be a useful tool for relaxing the normality assumption, which 
is often made only for convenience. 
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